An Optimized Matrix Multiplication on ARMv7 Architecture
نویسندگان
چکیده
A sufficiently optimized matrix multiplication on embedded systems can facilitate data processing in high performance mobile measuring equipment since plenty of the kernel mathematical algorithms are based on matrix multiplication. In this paper, we propose a matrix multiplication specially optimized for ARMv7 architecture. The performance-critical differences between ARMv7 and conventional desktop/server architecture are considered to block the simple implementation. The Advanced-SIMD (Single Instruction Multiple Data) engine NEON is additionally exploited to increase the arithmetic computing performance and decrease the memory access latency. Experimental results demonstrate that the proposed scheme is 7-20 times faster than the simple implementation and superior to popular algorithm and open source libraries.
منابع مشابه
A New Parallel Matrix Multiplication Method Adapted on Fibonacci Hypercube Structure
The objective of this study was to develop a new optimal parallel algorithm for matrix multiplication which could run on a Fibonacci Hypercube structure. Most of the popular algorithms for parallel matrix multiplication can not run on Fibonacci Hypercube structure, therefore giving a method that can be run on all structures especially Fibonacci Hypercube structure is necessary for parallel matr...
متن کاملBinary field multiplication on ARMv8
In this paper, we show efficient implementations of binary field multiplication over ARMv8. We exploit an advanced 64-bit polynomial multiplication (PMULL) supported by ARMv8 and conduct multiple levels of asymptotically faster Karatsuba multiplication. Finally, our method conducts binary field multiplication within 57 clock cycles for B-251. Our proposed method on ARMv8 improves the performanc...
متن کاملImplementing GCM on ARMv8
The Galois/Counter Mode is an authenticated encryption scheme which is included in protocols such as TLS and IPSec. Its implementation requires multiplication over a binary finite field, an operation which is costly to implement in software. Recent processors have included instructions aimed to speed up binary polynomial multiplication, an operation which can be used to implement binary field m...
متن کاملAlgorithm of Automatic Parallelization of Generalized Matrix Multiplication
Parallelization of generalized matrix-matrix multiplication is crucial for achieving high performance required in many situations. Parallelization performed using contemporary compilers is not sufficient enough to replace expert-tuned multi-threaded implementations or to get close to their performance. All competitive solutions require previously optimized external implementations that cannot b...
متن کاملpOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures
We have developed pOSKI: the Parallel Optimized Sparse Kernel Interface – an autotuning framework to optimize Sparse Matrix Vector Multiply (SpMV) performance on emerging shared memory multicore architectures. Our autotuning methodology extends previous work done in the scientific computing community targeting serial architectures. In addition to previously explored parallel optimizations, we f...
متن کامل